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ABSTRACT 


In this paper it is shown that the classical maximum likelihood principle can be 
considered to be a method of asymptotic realization of an optimum estimate with re- 
spect to a very general information theoretic criterion. This observation allows an exten- 
pon of the principle to provide answers to many practical problems of statistical model 

tting. 


1. INTRODUCTION 


The extension of the maximum likelihood principle which we are proposing 
in this paper was first announced by the author in a recent paper [6] in the 
following form: T 

Given a set of estimates 6’s of the vector of parameters 6 of a probability 
distribution with density function f(x |0) we adopt as our final estimate 
the one which will give the maximum of the expected log-likelihood, which 
is by definition 

Blog f(X | 6) = E J f(z] 6)log f(x | 6) de, (1.1) 
where X is a random variable following the distribution with the density 
function f(x | 6) and is independent of 0. 

This seems to be a formal extension of the classical maximum likelihood 
principle but a simple reflection shows that this is equivalent to maximizing 
an information theoretic quantity which is given by the definition 


F(X | 4 f i E 
E lo =E lo : (1.2) 
e æt] = J Mel 8 tay 

The integral in the right-hand side of the above equation gives the Kullback— 
Leibler’s mean information for discrimination between f(x | 6) and f(x | 0) 
and is known to give a measure of separation or distance between the two 
distributions [15]. This observation makes it clear that what we are propos- 
ing here is the adoption of an information theoretic quantity of the discre- 
pancy between the estimated and the true probability distributions to define 
the loss function of an estimate Ô of 6. It is well recognized that the statistical 
estimation theory should and can be organized within the framework of the 
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theory of statistical decision functions [25]. The only difficulty in realizing 
this is the choice of a proper loss function, a point which is discussed in 
details in a paper by Le Cam [17]. 

In the following sections it will be shown that our present choice of the 
information theoretic loss function is a very natural and reasonable one to 
develop a unified asymptotic theory of estimation. We will first discuss the 
definition of the amount of information and make clear the relative merit, 
in relation to the asymptotic estimation theory, of the Kullback—Leibler 
type information within the infinitely many possible alternatives. The dis- 
cussion will reveal that the log-likelihood is essentially a more natural 
quantity than the simple likelihood to be used for the definition of the maxi- 
mum likelihood principle. 

Our extended maximum likelihood principle can most effectively be ap- 
plied for the decision of the final estimate of a finite parameter model when 
many alternative maximum likelihood estimates are obtained correspond- 
ing to the various restrictions of the model. The log-likelihood ratio statis- 
tics developed for the test of composite hypotheses can most conveniently 
be used for this purpose and it reveals the truly statistical nature of the 
information theoretic quantities which have often been considered to be 
probabilistic rather than statistical [21]. 

With the aid of this log-likelihood ratio statistics our extended maximum 
likelihood principle can provide solutions for various important practical 
problems which have hitherto been treated as problems of statistical hypo- 
thesis testing rather than of statistical decision or estimation. Among the 
possible applications there are the decisions of the number of factors in the 
factor analysis, of the significant factors in the analysis of variance, of the 
number of independent variables to be included into multiple regression and 
of the order of autoregressive and other finite parameter models of stationary 
time series. 

Numerical examples are given to illustrate the difference of our present 
approach from the conventional procedure of successive applications of 
statistical tests for the determination of the order of autoregressive models. 
The results will convincingly suggest that our new approach will eventually 
be replacing many of the hitherto developed conventional statistical pro- 
cedures. 


2. INFORMATION AND DISCRIMINATION 


It can be shown [9] that for the purpose of discrimination between the 
two probability distributions with density functions f(x) (i = 0,1) all the 
necessary information are contained in the likelihood ratio T(x) = 
= f,(x)/fo(x) in the sense that any decision procedure with a prescribed loss of 
discriminating the two distributions based on a realization of a sample point 
x can, if it is realizable at all, equivalently be realized through the use of 
T(x). If we consider that the information supplied by observing a realization 
of a (set of) random variable(s) is essentially summarized in its effect of 
leading us to the discrimination of various hypotheses, it will be reasonable 
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to assume that the amount of information obtained by observing a realiza- 
tion x must be a function of T(x) = f,(x)/fo(a)- 

Following the above observation, the natural definition of the mean amount 
of information for discrimination per observation when the actual distribu- 
tion is f,(x) will be given by 


If fos) = { ae 
0 


where ®(r) is a properly chosen function of r and dx denotes the measure 
with respect to which f,(z) are defined. We shall hereafter be concerned with 
the parametric situation where the densities are specified by a set of param- 
eters 6 in the form 

f(z) = f(x | 6), (2.2) 


where it is assumed that 0 is an D-dimensional vector, 0 = (,, 02, - - ., 91)’, 
where ’ denotes the transpose. We assume that the true distribution under 
observation is specified by 0 = 0 = (6,,6,,...,6,)’. We will denote by 
(6, 0; ) the quantity defined by (2.1) with f(x) = f(z | 0) and f(x) = f(x | 0) 
and analyze the sensitivity of I(0, 0; ©) to the deviation of 6 from 0. Assum- 
ing the regularity conditions of f(x | 0) and (r) which assure the following 
analytical treatment we get 


folz) dx, (2.1) 


Groene issn g], eoo ffs oe an 
aap, OA) ln J [| [zal Fea ei 
+ f [2] alre = oral 
cst es] | 


P a? fo 
È | Se 
tem Í ls = 6-0 tos 


. = 2 
where r, ®(1), ®(1) and fe dor E |0) i aot) | r ais 
f(w|®) dr jm a? |m 
respectively, and the meaning of the other quantities will be clear from 
the context. Taking into account that we are assuming the validity of 
differentiation under integral sign and that {f(x | @)da = 1, we have 


pe-e 


and f(z | 6), 
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Thus we get 


I(8, 0; ®) = &(1) (2.6) 

wl 9; D) |p. = 0 (2.7) 
æ . r afo 1) (afo 1 

30, a aml lois bu f le 7 bar 7 on 


These relations show that @(1) must be different from zero if I(0, 6; ®) 
ought to be sensitive to the small variations of 6. Also it is clear that the 


relative sensitivity of I (0, 0; Ø) is high when = | is large. This will be the 


case when (1) = 0. The integral on the right-hand side of (2.8) defines the 
(l, m)th element of Fisher’s information matrix [16] and the above results 
show that this matrix is playing a central role in determining the behaviour 
of our mean information J(6, 0; ©) for small variations of 0 around 0. The 
possible forms of (r) are e.g. log r, (r — 1)? and rt and we cannot decide 
uniquely at this stage. 

To restrict further the form of ®(r) we consider the effect of the increase 
of information by N independent observations of X. For this case we have 
to consider the quantity 


TI elo) w 
Iy(0,0;0) = | o| i I felo) dx - - . (29) 
ii f(x; |0) | = 
Corresponding to (2.5), (2.6) and (2.7) we have 
Ty(0, 0; D) = I(0, 0; ®) (2.10) 
ain 8; D) Joz = 0 (2.11) 
—© _1,(6,6; D) | ozo = N —“1(6,0;®) lone. (2-12) 
F bm 20; 3m 


These equations show that Iy(0, 0; Ø) is not responsive to the increase of 
information and that 2- Iy(6, 0; ®) |o=o is in a linear relation with N. 


It can be seen that only the quantity defined by 
N 
8 IT fle; | 9) 1 
i=1 
N 
e TI Ke |®) 
i= 


af (a; | 0) | (2.13) 
6. 


30, — folo= 


6=6 i=l 
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is concerned with the derivation of this last relation. This shows very clearly 
that taking into account the relation 


əf(z |0) 1 _ dlog f(x |0) 
30,. fo 30, , 


(2.14) 


the functions a log f(x | 0) are playing the central role in the present defi- 


nition of information. This observation suggests the adoption of O(r) = log r 
for the definition of our amount of information and we are very naturally 
led to the use of Kullback—Leibler’s definition of information for the purpose 
of our present study. 

It should be noted here that at least asymptotically any other definition 
of (r) will be useful if only @(1) is not vanishing. The main point of our 
present observation will rather be the recognition of the essential role being 


played by the functions Flog f(x | 0) for the definition of the mean infor- 


1 
mation for the discrimination of the distributions corresponding to the 
small deviations of 0 from 6. 


3. INFORMATION AND THE MAXIMUM LIKELIHOOD PRINCIPLE 
Since the purpose of estimating the parameters of f(x | 0) is to base our 


decision on f(x |), where 6 is an estimate of 0, the discussion in the preced- 
ing section suggests the adoption of the following loss and risk functions: 


W(®, 6) = (—2) | f(x |) log fe (Ô) ax (3.1) 
f(x | 6) 


Re, 6) = EW (0, 6), (3.2) 


where the expectation in the right-hand side of (3.2) is taken with respect to 
the distribution of Ê. As W (0, 6) is equal to 2 times the Kullback-Leibler’s 
information for discrimination in favour of f(x | 6) for f(x | 6) it is known that 
W(0, 6) is a non-negative quantity and is equal to zero if and only if f(x | 0) = 
= f(x | 6) almost everywhere [16]. This property is forming a basis of the 
proof of consistency of the maximum likelihood estimate of @ [24] and indi- 
cates the close relationship between the maximum likelihood principle and 
the information theoretic observations. 

When N independent realizations x; (i = 1, 2,..., N) of X are available, 
(—2) times the sample mean of the log-likelihood ratio 


A Fæ: | 6) 3.8 
N & ie, 18) aa 
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will be a consistent estimate of W(0, 6). Thus it is quite natural to expect 
that, at least for large N, the value of Ô which will give the maximum of 
(3.3) will nearly minimize W(6, 6). Fortunately the maximization of (3.3) 
can be realized without knowing the true value of 90, giving the well-known 
maximum likelihood estimate 9. Though it has been said that the maximum 
likelihood principle is not based on any clearly defined optimum consider- 
ation [18; p. 15] our present observation has made it clear that it is essen- 
tially designed to keep minimum the estimated loss function which is very 
naturally defined as the mean information for discrimination between the 
estimated and the true distributions. 


4, EXTENSION OF THE MAXIMUM LIKELIHOOD PRINCIPLE 


The maximum likelihood principle has mainly been utilized in two dif- 
ferent branches of statistical theories. The first is the estimation theory where 
the method of maximum likelihood has been used extensively and the 
second is the test theory where the log-likelihood ratio statistic is playing 
a very important role. Our present definitions of W(0, 6) and R(®, 6) suggest 
that these two problems should be combined into a single problem of statis- 
tical decision. Thus instead of considering a single estimate of @ we consider 
estimates corresponding to various possible restrictions of the distribution 
and instead of treating the problem as a multiple decision or a test between 
hypotheses we treat it as a problem of general estimation procedure based 
on the decision theoretic consideration. This whole idea can be very simply 
realized by comparing R(@, 6)’, or W(8, 6) if possible, for various 6’s and 
taking the one with the minimum of R(6, 6) or W(8, 6) as our final choice. 
As it was discussed in the introduction this approach may be viewed as a 
natural extension of the classical maximum likelihood principle. The only 
problem in applying this extended principle in a practical situation is how 
to get the reliable estimates of R(6, Ê) or W(6, 6). As it was noticed in [6] 
and will be seen shortly, this can be done for a very interesting and practi- 
cally important situation of composite hypotheses through the use of the 
maximum likelihood estimates and the corresponding log-likelihood ratio 
statistics. 

The problem of statistical model identification is often formulated as the 
problem of the selection of f(x | 48) (k = 0, 1, 2, . . ., L) based on the obser- 
vations of X, where ,4 is restricted to the space with bry. = 9xn42= ---= 
= „9, = 0. k, or some of its equivalents, is often called the order of the 
model. Its decision is usually the most difficult problem in practical statis- 
tical model identification. The problem has often been treated as a subject 
of composite hypothesis testing and the use of the log-likelihood ratio cri- 
terion is well established for this purpose [23]. We consider the situation 
where the results x; (i = 1,2,..., N) of N independent observations of 
X have been obtained. We denote by ,4 the maximum likelihood estimate 
in the space of ,0, i.e., ,6 is the value of ,9 which gives the maximum of the 
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N 
likelihood function JJ f(z; | ,0). The observation at the end of the preceding 
i= 


section strongly suggests the use of 


apd, Sm, fel 1) (4.1) 


2 

NEA Fæ | 18) 

as an estimate of W(@, ,0). The statistics 

L =N Xw (4.2) 


is the familiar log-likelihood ratio test statistics which will asymptotically 
be distributed as a chi-square variable with the degrees of freedom equal to 
L — k when the true parameter 0 is in the space of „8. If we define 


W(0, ,0) = inf W(®, ,9), (4.3) 


then it is expected that 
xw — W(0, ,8) w-p.l. 


Thus when NW(9, ,9) is significantly larger than L the value of xn, will be 
very much larger than would be expected from the chi-square approximation. 
The only situation where a precise analysis of the behaviour of 7, is neces- 
sary would be the case where N W (0, ,0) is of comparable order of magnitude 
with L. When N is very large compared with L this means that W(0, 8) 
is very nearly equal to W(0, 6) = 0. We shall hereafter assume that W (9, 0) 
is sufficiently smooth at 6 = 0 and 


W(6,6) >0 for 0 = 9. (4.4) 
Also we assume that W(0, ,9) has a unique minimum at ,6 = ,6 and that 


190 = 0. Under these assumptions the maximum likelihood estimates 6 and 


,0 will be consistent estimates of @ and ,9, respectively, and since we are 
concerned with the situation where 0 and ,9 are situated very near to each 
other, we limit our observation only up to the second-order variation of 


W(0, ,). Thus hereafter we adopt, in place of W(8, ,0), the loss function 
‘ È uh. = 
W(0, 8) = © > (bi — 81) (ôm — Om) Ol, m) (8), (4.5) 
i=1 m=1 
where C(I, m)(@) is the (J, m)th element of Fisher’s information matrix and 
is given by 


CU, m) (0) ioe ee | fede = f TREN fda, (4.6) 


We shall simply denote by C (l, m) the value of C(J, m)(0) at 0 = 8. We denote 
by || 6 || the norm in the space of @ defined by 


L L 
6 [B= X 914m Ol, m). (4.7) 
‘1 


i=l m= 
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We have 
W,(8, .8) = || 6 — 0 |B- (4.8) 
Also we redefine ¿9 by the relation 
|x= 8 è = Min |] A — 0 [Ee (4.9) 


Thus ;9 is the projection of 0 in the space of ,6’s with respect to the metrics 
defined by C(Z, m) and is given by the relations 


k L 
> Olm Pn= > Cll, m) Om 1=1,2,...%. (4.10) 


m=1 m=1 


We get from (4.8) and (4.9) 


W,(8, ô) = || 8 — © |2 + I| ô — 8 |È- (4.11) 


Since the definition of W(0, 6) strongly suggests, and is actually motivated 
by, the use of the log-likelihood ratio statistics we will study the possible 


use of this statistics for the estimation of W,(0, ô). Taking into account the 
relations 


> aloe fle) — 0, m =1,2,...,L, 


i 3b m 
(4.12) 
> Plog fai |W) — 0, m == 1,2,.. k, 
i 26m 
we get the Taylor expansions 
N N 4 1m z Š A 
a log f(z; | ,8) = > log f(z; 18) +> S SNO — 9m) (x81 — 61) 
i= {= 2 m=1i=1 
x is æ log f(z; | 6 + el — 6)) = (4.13) 


NA 3b m 86; 


N re a a 
= Slog fe Â) + X S NOn Ên) 8s — ÂD 
i=1 


m=1 |=1 


xb S Plog Ae |ð + o8 — ô), 
NA 86, 30; 


where the parameter values within the functions under the differential sign 
denote the points where the derivatives are taken and 0 < ọpọ < 1l, a 
convention which we use in the rest of this paper. We consider that, in in- 
creasing the value of N, N and k are chosen in such a way that VN (ðm — 
— 0m) (m = 1, 2, . . ., L) are bounded, or rather tending to a set of constants 
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for the ease of explanation. Under this circumstance, assuming the tendency 
towards a Gaussian distribution of (6 — 6) and the consistency of ,0 


and 6 as the estimates of 8 and 0 we get, from (4.6) and (4.13), an asymptotic 
equality in distribution for the log-likelihood ratio statistic „nz of (4.2) 


it =N ||6 — 8 |2 — N |] ô — 8 |f- (4.14) 
By simple manipulation 
iM =N ||,6—0|2+N || ô — 0 |? — N |l xô — 18 Iè - 
— 2N(6 — 6, ,8 — 9), (4.15) 


where (,)¢ denotes the inner product defined by C(/, m). Assuming the valid- 
ity of the Taylor expansion up to the second order and taking into account 
the relations (4.12) we get for l = 1,2,...,k 


1 Xa 

Tw & 7908/1 &)= 

=, & ,,\1 X elo KAFE: (,8 — 6)) 

= W(,0,, — .6.) — SIN: | x Ord KI) (4.16 
2 VM in Om) S 96, 30, (4.16) 
$ jal X 2log f(x; | ê+ o8 — ô) 

= N b — Om) = SIN [Av 

PAL Te Om, 00; ' 


Let 0-1 be the inverse of Fisher’s information matrix. Assuming the tendency 
to the Gaussian distribution N(0, 0-1) of the distribution of /W(6 — 6), 
which can be derived by using the Taylor expansion of the type of (4.16) 
at 0 = 0, we can see that for N and k with bounded !V(,0,— ®,) (m = 1, 
2,...,Z) (4.16) yields, under the smoothness assumption of C(/, m) (6) at 
6 = 6, the approximate equations 


SVN n — bm) CU, m) = SVN dm Ön) CU, m)1=1,2,...,&. (4.17) 


m=1 m=1 


Taking (4.10) into account we get from (4.17), for l = 1, 2, . . . k, 


k ; P E es a 
XVN ðm — Ôm) Cl, m) = X VN Om — Om) Cll, m). (4.18) 
m=1 _ mal 
This shows that geometrically 0 — ,8 is (approximately) the projection of 
6 — @ into the space of ,6’s. From this result it can be shown that N || 6 — 
—6|?— N ||,4—,0|2 and N || ,6 — ,0|2 are asymptotically indepen- 
dently distributed as chi-square variables with the degrees of freedom L — k 
and Ẹ, respectively. It can also be shown that the standard deviation of the 


asymptotic distribution of N(6 — 0, ,0 — 0). is equal to VN || ,@ — 8 |]e- 
Thus if N || ,6 — 0 ||? is of comparable magnitude with L — k or k and 


18* 
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these are large integers then the contribution of the last term in the right 
hand side of (4.15) remains relatively insignificant. If N || ,0 — © ||¢ is 
significantly larger than L the contribution of N (Ê — 0, ,8 — 8), to pne 
will also relatively be insignificant. If N || ,8 — 0 ||? is significantly smaller 
than L and k again the contribution of N (ô — 0, ,8 — 8), will remain insig- 
nificant compared with those of other variables of chi-square type. These 
observations suggest that from (4.11), though Npn, may not be a good 
estimate of W,(0, ,9), 


rÔ, Â) = N-n, + 2k — L) (4.19) 


will serve as a useful estimate of EW,(0, ,9), at least for the case where N is 
sufficiently large and L and k are relatively large integers. 

It is interesting to note that in practical applications it may sometimes 
happen that L is a very large, or conceptually infinite, integer and may not 
be defined clearly. Even under such circumstances we can realize our selec- 
tion procedure of ,6’s for some limited number of k’s, assuming L to be equal 
to the largest value of k. Since we are only concerned with finding out 
the ,8 which will give the minimum of r(6, ,8) we have only to compute either 


wL = ee + 2k (4.20) 
or 


N m 
pA, = — 2 X log f(z; | Ô) + 2k, (4.21) 
i=1 


and adopt the ,6 which gives the minimum of z or ,A, (0 < k < L). The 
statistical behaviour of 4z is well understood by taking into consideration 
the successive decomposition of the chi-square variables into mutually 
independent components. In using 4, care should be taken not to lose 
significant digits during the computation. 


5. APPLICATIONS 


Some of the possible applications will be mentioned here. 


1) Factor analysis 


In the factor analysis we try to find the best estimate of the variance 
covariance matrix Y from the sample variance covariance matrix using the 
model X = AA’ + D, where Z is a p Xp dimensional matrix, A is ap xm 
dimensional (m < p) matrix and D is a non-negative p Xp diagonal matrix. 
The method of the maximum likelihood estimate under the assumption of 
normality has been extensively applied and the use of the log-likelihood ratio 
criterion is quite common. Thus our present procedure can readily be incor- 
porated to help the decision of m. Some numerical examples are already 
given in [6] and the results are quite promising. 
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2) Principal component analysis 


By assuming D = 61(6 > 0, J; unit matrix) in the above model, we can 
get the necessary decision procedure for the principal component analysis. 


3) Analysis of variance 


Tf in the analysis of variance model we can preassign the order in decom- 
posing the total variance into chi-square components corresponding to some 
factors and interactions then we can easily apply our present procedure to 
decide where to stop the decomposition. 


4) Multiple regression 


The situation is the same as in the case of the analysis of variance. We can 
make a decision where to stop including the independent variables when the 
order of variables for inclusion is predetermined. It can be shown that under 
the assumption of normality of the residual variable we have only to com- 


pare the values s*(k) [1 + =) where s*(k) is the sample mean square of the 


residual after fitting the regression coefficients by the method of least squares 
where & is the number of fitted regression coefficients and N the sample size. 
k should be kept small compared with N. It is interesting to note that the 
use of a statistics proposed by Mallows [13] is essentially equivalent to our 
present approach. 


5) Autoregressive model fitting im time series 


Though the discussion in the present paper has been limited to the reali- 
zations of independent and identically distributed random variables, by 
following the approach of Billingsley [8], we can see that the same line of 
discussion can be extended to cover the case of finite parameter Markov 
processes. Thus in the case of the fitting of one-dimensional autoregressive 


k 
model X,= J GmXn-m-+ én We have, assuming the normality of the 
m=1 


process X,, only to adopt k which gives the minimum of s°(k) [: + > or 


a 
equivalently s*(&)|1+ 5 l1- Ż , where s?(k) is the sample mean square 


of the residual after fitting the kth order model by the method of least squares 
or some of its equivalents. This last quantity for the decision has been 
first introduced by the present author and was considered to be an estimate 
of the quantity called the final prediction error (FPE) [1, 2]. The use of this 
approach for the estimation of power spectra has been discussed and recog- 
nized to be very useful [3]. For the case of the multi-dimensional process 
we have to replace s*(k) by the sample generalized variance or the determi- 
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nant of the sample variance-covariance matrix of residuals. The procedure 
has been extensively used for the identification of a cement rotary kiln 
model [4, 5, 19]. 

These procedures have been originally derived under the assumption of 
linear process, which is slightly weaker than the assumption of normality, 
and with the intuitive criterion of the expected variance of the final one step 
prediction (FPE). Our present observation shows that these procedures are 
just in accordance with our extended maximum likelihood principle at least 
under the Gaussian assumption. 


6. NUMERICAL EXAMPLES 


To illustrate the difference between the conventional test procedure and 
our present procedure, two numerical examples are given using published 
data. 

The first example is taken from the book by Jenkins and Watts [14]. The 
original data are described as observations of yield from 70 consecutive 
batches of an industrial process [14, p. 142]. Our estimates of FPE are given 
in Table 1 in a relative scale. The results very simply suggest, without the 
help of statistical tables, the adoption of k = 2 for this case. The same con- 
clusion has been reached by the authors of the book after a detailed analysis 
of significance of partial autocorrelation coefficients and by relying on a 
somewhat subjective judgement [14, pp. 199—200]. The fitted model pro- 
duced an estimate of the power spectrum which is very much like their final 
choice obtained by using Blackman—Tukey type window [14, p. 292]. 


Table 1 
Autoregressive model fitting 


k 0 1 2 3 4 5 6 7 


FPE,* 1.029 0.899 0.895 0.921 0.946 0.970 0.983 1.012 


* FPE; = s*(k) f ; = (2 H) 0 


The next example is taken from a paper by Whittle on the analysis of a 
seiche record (oscillation of water level in a rock channel) [26; 27, pp- 
37-38]. For this example Whittle has used the log-likelihood ratio test 
statistics in successively deciding the significance of increasing the order by 
one and adopted k = 4. He reports that the fitting of the power spectrum 
is very poor. Our procedure applied to the reported sample autocorrelation 
coefficients obtained from data with N = 660 produced a result showing 
that k = 65 should be adopted within the k’s in the range 0 < k < 66. 
The estimates of the power spectrum are illustrated in Fig. 1. Our procedure 
suggests that L = 66 is not large enough, yet it produced very sharp 


Extension of the maximum likelihood principle 279 


line-like spectra at various frequencies as was expected from the physical 
consideration, while the fourth order model did not give any indication of 
them. This example dramatically illustrates the impracticality of the con- 


ventional successive test procedure depending on a subjectively chosen set 
of levels of significance. 


[PCF Cx (Ole 
10 4x 10° 


N: dato | 
=660 tial 
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85 
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Fig. 1. Estimates of the seiche spectrum. 
The smoothed periodgram of 2(n4t) (n = 1, 2,..., N) is defined by 


L Ist 
as (2 = ca Cz2(s) cos(27 fs 48), 
1 Fan ~ > 
where I = max. lag, Ca(s)=— X (|8| +n) 2m), 
N 


where 2(n) = z(n 4t) — Z and z = 


7. CONCLUDING REMARKS 


In spite of the early statement by Wiener [28; p. 76] that entropy, the 
Shannon-Wiener type definition of the amount of information, could re- 
place Fisher’s definition [11] the use of the information theoretic concepts 
in the statistical circle has been quite limited [10, 12, 20]. The distinction 
between Shannon-Wiener’s entropy and Fisher’s information was discussed 
as early as in 1950 by Bartlett [7], where the use of the Kullback-Leibler 
type definition of information was implicit. Since then in the theory of sta- 
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tistics Kullback—Leibler’s or Fisher’s information could not enjoy the prom- 
inent status of Shannon’s entropy in communication theory, which proved 
its essential meaning through the source coding theorem [22, p. 28]. 

The analysis in the present paper shows that the information theoretic 
consideration can provide a foundation of the classical maximum likeli- 
hood principle and extremely widen its practical applicability. This shows 
that the notion of informations, which is more closely related to the mutual 
information in communication theory than to the entropy, will play the most 
fundamental role in the future developments of statistical theories and tech- 
niques. 

By our present principle, the extensions of applications 3) ~ 5) of Section 
5 to include the comparisons of every possible kth order models are straight- 
forward. The analysis of the overall statistical characteristics of such exten- 
sions will be a subject of further study. 
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